CHE 1148 : Assignment-5¶

PART-2 LIME and SHAP interpretations¶

Imporitng relevant packages

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.ensemble import RandomForestClassifier
In [2]:
import warnings
warnings.filterwarnings('ignore')
In [3]:
pip install lime
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Requirement already satisfied: lime in /usr/local/lib/python3.9/dist-packages (0.2.0.1)
Requirement already satisfied: scipy in /usr/local/lib/python3.9/dist-packages (from lime) (1.10.1)
Requirement already satisfied: matplotlib in /usr/local/lib/python3.9/dist-packages (from lime) (3.7.1)
Requirement already satisfied: scikit-learn>=0.18 in /usr/local/lib/python3.9/dist-packages (from lime) (1.2.2)
Requirement already satisfied: tqdm in /usr/local/lib/python3.9/dist-packages (from lime) (4.65.0)
Requirement already satisfied: numpy in /usr/local/lib/python3.9/dist-packages (from lime) (1.22.4)
Requirement already satisfied: scikit-image>=0.12 in /usr/local/lib/python3.9/dist-packages (from lime) (0.19.3)
Requirement already satisfied: imageio>=2.4.1 in /usr/local/lib/python3.9/dist-packages (from scikit-image>=0.12->lime) (2.25.1)
Requirement already satisfied: pillow!=7.1.0,!=7.1.1,!=8.3.0,>=6.1.0 in /usr/local/lib/python3.9/dist-packages (from scikit-image>=0.12->lime) (8.4.0)
Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.9/dist-packages (from scikit-image>=0.12->lime) (23.0)
Requirement already satisfied: tifffile>=2019.7.26 in /usr/local/lib/python3.9/dist-packages (from scikit-image>=0.12->lime) (2023.3.21)
Requirement already satisfied: networkx>=2.2 in /usr/local/lib/python3.9/dist-packages (from scikit-image>=0.12->lime) (3.0)
Requirement already satisfied: PyWavelets>=1.1.1 in /usr/local/lib/python3.9/dist-packages (from scikit-image>=0.12->lime) (1.4.1)
Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.9/dist-packages (from scikit-learn>=0.18->lime) (3.1.0)
Requirement already satisfied: joblib>=1.1.1 in /usr/local/lib/python3.9/dist-packages (from scikit-learn>=0.18->lime) (1.1.1)
Requirement already satisfied: contourpy>=1.0.1 in /usr/local/lib/python3.9/dist-packages (from matplotlib->lime) (1.0.7)
Requirement already satisfied: importlib-resources>=3.2.0 in /usr/local/lib/python3.9/dist-packages (from matplotlib->lime) (5.12.0)
Requirement already satisfied: python-dateutil>=2.7 in /usr/local/lib/python3.9/dist-packages (from matplotlib->lime) (2.8.2)
Requirement already satisfied: pyparsing>=2.3.1 in /usr/local/lib/python3.9/dist-packages (from matplotlib->lime) (3.0.9)
Requirement already satisfied: fonttools>=4.22.0 in /usr/local/lib/python3.9/dist-packages (from matplotlib->lime) (4.39.3)
Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.9/dist-packages (from matplotlib->lime) (0.11.0)
Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.9/dist-packages (from matplotlib->lime) (1.4.4)
Requirement already satisfied: zipp>=3.1.0 in /usr/local/lib/python3.9/dist-packages (from importlib-resources>=3.2.0->matplotlib->lime) (3.15.0)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.9/dist-packages (from python-dateutil>=2.7->matplotlib->lime) (1.16.0)
In [4]:
import lime
from lime import lime_tabular
In [5]:
#imporitng monthly table
df = pd.read_csv("/content/monthly_table.csv")
df
Out[5]:
CLNT_NO ME_DT mth_txn_amt_sum mth_txn_cnt amt_sum_3M amt_mean_3M amt_max_3M txn_cnt_sum_3M txn_cnt_mean_3M txn_cnt_max_3M ... cnt_Monday cnt_Saturday cnt_Sunday cnt_Thursday cnt_Tuesday cnt_Wednesday last_monthly_purchase days_since_last_txn customer_id response
0 CS1112 2011-05-31 72 1 0 0.000000 0 0 0.0 0 ... 0 0 0 0 0 0 0 0 CS1112 0
1 CS1112 2011-06-30 56 1 0 0.000000 0 0 0.0 0 ... 0 0 0 0 0 1 2011-06-15 00:00:00 15 CS1112 0
2 CS1112 2011-07-31 72 1 200 66.666667 72 3 1.0 1 ... 0 0 0 0 0 0 2011-06-15 00:00:00 46 CS1112 0
3 CS1112 2011-08-31 96 1 224 74.666667 96 3 1.0 1 ... 0 0 0 0 0 0 2011-08-19 00:00:00 12 CS1112 0
4 CS1112 2011-09-30 72 1 240 80.000000 96 3 1.0 1 ... 0 0 0 0 0 0 2011-08-19 00:00:00 42 CS1112 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
323543 CS9000 2014-11-30 72 1 216 72.000000 72 3 1.0 1 ... 0 0 0 0 0 0 2014-08-24 00:00:00 98 CS9000 0
323544 CS9000 2014-12-31 72 1 216 72.000000 72 3 1.0 1 ... 0 0 0 0 0 0 2014-08-24 00:00:00 129 CS9000 0
323545 CS9000 2015-01-31 72 1 216 72.000000 72 3 1.0 1 ... 0 0 0 0 0 0 2014-08-24 00:00:00 160 CS9000 0
323546 CS9000 2015-02-28 34 1 178 59.333333 72 3 1.0 1 ... 0 1 0 0 0 0 2015-02-28 00:00:00 0 CS9000 0
323547 CS9000 2015-03-31 72 1 178 59.333333 72 3 1.0 1 ... 0 0 0 0 0 0 2015-02-28 00:00:00 31 CS9000 0

323548 rows × 33 columns

In Feb-2014, clients CS1350 and CS1200 emailed my customer service department complaining about the company’s decision to market to them (or the lack of it). Hence, I will collect all the data for the said customers up untill January 2014.

In [6]:
#Filtering out data untill January 2014

df_new = df[:] #new dataframe to filter out the required data
df_new = df_new[(df_new['ME_DT'] < '2014-01-31')]
In [7]:
#Filtering out data for interested clients up until Jan 2014

clients = ["CS1350","CS1200"]
df_new = df_new[df_new["CLNT_NO"].isin(clients)]
df_new.reset_index(drop=True, inplace=True)
df_new
Out[7]:
CLNT_NO ME_DT mth_txn_amt_sum mth_txn_cnt amt_sum_3M amt_mean_3M amt_max_3M txn_cnt_sum_3M txn_cnt_mean_3M txn_cnt_max_3M ... cnt_Monday cnt_Saturday cnt_Sunday cnt_Thursday cnt_Tuesday cnt_Wednesday last_monthly_purchase days_since_last_txn customer_id response
0 CS1200 2011-05-31 72 1 216 72.000000 72 3 1.000000 1 ... 0 0 0 0 0 0 0 0 CS1200 0
1 CS1200 2011-06-30 94 1 238 79.333333 94 3 1.000000 1 ... 0 0 0 0 0 0 2011-06-03 00:00:00 27 CS1200 0
2 CS1200 2011-07-31 72 1 238 79.333333 94 3 1.000000 1 ... 0 0 0 0 0 0 2011-06-03 00:00:00 58 CS1200 0
3 CS1200 2011-08-31 72 1 238 79.333333 94 3 1.000000 1 ... 0 0 0 0 0 0 2011-06-03 00:00:00 89 CS1200 0
4 CS1200 2011-09-30 170 2 314 104.666667 170 4 1.333333 2 ... 0 1 0 0 0 0 2011-09-10 00:00:00 20 CS1200 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
59 CS1350 2013-08-31 84 1 231 77.000000 84 3 1.000000 1 ... 0 0 1 0 0 0 2013-08-04 00:00:00 27 CS1350 1
60 CS1350 2013-09-30 85 1 244 81.333333 85 3 1.000000 1 ... 0 0 1 0 0 0 2013-09-29 00:00:00 1 CS1350 1
61 CS1350 2013-10-31 120 2 289 96.333333 120 4 1.333333 2 ... 2 0 0 0 0 0 2013-10-14 00:00:00 17 CS1350 1
62 CS1350 2013-11-30 72 1 277 92.333333 120 4 1.333333 2 ... 0 0 0 0 0 0 2013-10-14 00:00:00 47 CS1350 1
63 CS1350 2013-12-31 72 1 264 88.000000 120 4 1.333333 2 ... 0 0 0 0 0 0 2013-10-14 00:00:00 78 CS1350 1

64 rows × 33 columns

Creating train and test set

In [8]:
#Train set for random forest model

X_train = df_new.drop(['CLNT_NO','ME_DT','response','customer_id','last_monthly_purchase'],axis=1)
y_train = df_new["response"]
In [9]:
#Test set for CS1200 and CS1350 separately

X_test_CS1200 = df_new[df_new["CLNT_NO"].isin(["CS1200"])]
X_test_CS1350 = df_new[df_new["CLNT_NO"].isin(["CS1350"])]
In [10]:
response_CS1200 = X_test_CS1200["response"]
response_CS1350 = X_test_CS1350["response"]
In [11]:
X_test_CS1200 = X_test_CS1200.drop(['CLNT_NO','ME_DT','response','customer_id','last_monthly_purchase'],axis=1)
X_test_CS1350 = X_test_CS1350.drop(['CLNT_NO','ME_DT','response','customer_id','last_monthly_purchase'],axis=1)

Training the model

The best random forest model that I used before had the following parameters:

  1. max_depth =5
  2. ccp_alpha = 0.001
  3. class_weight = balanced
In [12]:
#Random Forest Classifier
estimator_rf = RandomForestClassifier(random_state=42, max_depth = 5, ccp_alpha= 0.001,class_weight="balanced")

#Retraining model 
estimator_rf.fit(X_train, y_train)
Out[12]:
RandomForestClassifier(ccp_alpha=0.001, class_weight='balanced', max_depth=5,
                       random_state=42)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestClassifier(ccp_alpha=0.001, class_weight='balanced', max_depth=5,
                       random_state=42)

LIME MODELS¶

In [13]:
#defining an explainer object

explainer = lime_tabular.LimeTabularExplainer(
    training_data=np.array(X_train),
    feature_names=X_train.columns,
    class_names=['bad', 'good'],
    mode='classification'
)

LIME MODEL for CS1200¶

In [14]:
#original responses for CS1200
response_CS1200
Out[14]:
0     0
1     0
2     0
3     0
4     0
5     0
6     0
7     0
8     0
9     0
10    0
11    0
12    0
13    0
14    0
15    0
16    0
17    0
18    0
19    0
20    0
21    0
22    0
23    0
24    0
25    0
26    0
27    0
28    0
29    0
30    0
31    0
Name: response, dtype: int64
In [15]:
#explianing the prediction for CS1200

indices = np.random.randint(0, X_test_CS1200.shape[0])
exp = explainer.explain_instance(
    data_row=X_test_CS1200.iloc[indices], 
    predict_fn=estimator_rf.predict_proba,
    num_features=28,
    top_labels = None,
    distance_metric='euclidean', 
    num_samples=1000
    
)

exp.show_in_notebook(show_table=True)

Interpretation:

We see that earlier the response for this customer was negative, meaning he did not want any phone calls on promotions. However, after conducting a LIME test, we see that the black box model predicts a negative response with 97% probability.

But still the customer received the call. This can be explained by the negative influence on the model prediction by the features "amt_max_12M", txn_cnt_sum_12M,"txn_cnt_max_12M" and "mth_txn_cnt".

This may also be due to reasons like the model has got biased which is likely possible due to the imbalanced nature of the dataset or the labelling for responses might have been wrong or the model might have been trained on wrong datasets. Another plausible reason can be a human error in making the call or in enlisting the customer into phone call lists.

LIME MODEL for CS1350¶

In [16]:
#original responses for CS1350
response_CS1350
Out[16]:
32    1
33    1
34    1
35    1
36    1
37    1
38    1
39    1
40    1
41    1
42    1
43    1
44    1
45    1
46    1
47    1
48    1
49    1
50    1
51    1
52    1
53    1
54    1
55    1
56    1
57    1
58    1
59    1
60    1
61    1
62    1
63    1
Name: response, dtype: int64
In [17]:
#explianing the prediction for CS1350

indices = np.random.randint(0, X_test_CS1350.shape[0])
exp = explainer.explain_instance(
    data_row=X_test_CS1350.iloc[indices], 
    predict_fn=estimator_rf.predict_proba,
    num_features=28,
    top_labels = None,
    distance_metric='euclidean', 
    num_samples=1000
    
)

exp.show_in_notebook(show_table=True)

Interpretation:

We see that earlier the response for this customer was positive which implies that he was expeting a promotional call or consented to it. However, after conducting a LIME test, we see that the blackbox model predicts a positive response with 88% probability.

But the customer still did not receive a call. Possible similar reasons can be employed here as due to the negative influence on model predicitions by the features "amt_max_12M","amt_mean_3M","txn_cnt_max_3M","txn_cnt_sum_6M" and "txn_cnt_max_6M".

The model could also be biased or responses labelled wrongly or there might be a human error.

Overall in general, LIME results are weighted by the proximity of the sampled instances to the instance of interest, hence this can not always be accurate or sufficient enough to do the investigation. We need more detailed investigation in the matter.

SHAP Plot (BONUS)¶

In [18]:
pip install shap
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Requirement already satisfied: shap in /usr/local/lib/python3.9/dist-packages (0.41.0)
Requirement already satisfied: slicer==0.0.7 in /usr/local/lib/python3.9/dist-packages (from shap) (0.0.7)
Requirement already satisfied: numpy in /usr/local/lib/python3.9/dist-packages (from shap) (1.22.4)
Requirement already satisfied: tqdm>4.25.0 in /usr/local/lib/python3.9/dist-packages (from shap) (4.65.0)
Requirement already satisfied: scipy in /usr/local/lib/python3.9/dist-packages (from shap) (1.10.1)
Requirement already satisfied: numba in /usr/local/lib/python3.9/dist-packages (from shap) (0.56.4)
Requirement already satisfied: scikit-learn in /usr/local/lib/python3.9/dist-packages (from shap) (1.2.2)
Requirement already satisfied: cloudpickle in /usr/local/lib/python3.9/dist-packages (from shap) (2.2.1)
Requirement already satisfied: packaging>20.9 in /usr/local/lib/python3.9/dist-packages (from shap) (23.0)
Requirement already satisfied: pandas in /usr/local/lib/python3.9/dist-packages (from shap) (1.4.4)
Requirement already satisfied: llvmlite<0.40,>=0.39.0dev0 in /usr/local/lib/python3.9/dist-packages (from numba->shap) (0.39.1)
Requirement already satisfied: setuptools in /usr/local/lib/python3.9/dist-packages (from numba->shap) (67.6.1)
Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.9/dist-packages (from pandas->shap) (2022.7.1)
Requirement already satisfied: python-dateutil>=2.8.1 in /usr/local/lib/python3.9/dist-packages (from pandas->shap) (2.8.2)
Requirement already satisfied: joblib>=1.1.1 in /usr/local/lib/python3.9/dist-packages (from scikit-learn->shap) (1.1.1)
Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.9/dist-packages (from scikit-learn->shap) (3.1.0)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.9/dist-packages (from python-dateutil>=2.8.1->pandas->shap) (1.16.0)
In [19]:
import shap

We have already trained our random forest model with dataset for these two clients. We'll create SHAP force plots for this model to explain the model.

  • SHAP Plot for CS1200
In [21]:
#force plot using kernel explainer for CS1200
explainer = shap.KernelExplainer(estimator_rf.predict_proba, X_train)
shap.initjs()
shap_values = explainer.shap_values(X_test_CS1200.iloc[0,:])
shap.force_plot(explainer.expected_value[0], shap_values[0], X_test_CS1200.iloc[0,:])
Out[21]:
Visualization omitted, Javascript library not loaded!
Have you run `initjs()` in this notebook? If this notebook was from another user you must also trust this notebook (File -> Trust notebook). If you are viewing this notebook on github the Javascript has been stripped for security. If you are using JupyterLab this error is because a JupyterLab extension has not yet been written.
In [22]:
#force plot using tree explainer for CS1200
#plot for first instance
tree_explainer = shap.TreeExplainer(estimator_rf)
shap.initjs()
shap_values = tree_explainer.shap_values(X_train)
shap.force_plot(tree_explainer.expected_value[0], shap_values[0], X_test_CS1200.iloc[0,:].values)
Out[22]:
Visualization omitted, Javascript library not loaded!
Have you run `initjs()` in this notebook? If this notebook was from another user you must also trust this notebook (File -> Trust notebook). If you are viewing this notebook on github the Javascript has been stripped for security. If you are using JupyterLab this error is because a JupyterLab extension has not yet been written.

Interpretation

The SHAP plot using kernel estimator clearly explains why client CS1200 receieved the call. Even if the client responded negative on receiving promotional calls, the feature "amt_max_12M" had a higher impact on pushing the model predictions towards a positive response and the size of the impact can be comprehended by the size of the bar. Too much higher values of this feature from the base value are influencing the model predictions and hence it needs to be lowered in magnitude.

The second plot represents only the instance of the first feature i.e. "amt_max_12M = 99 " and shows its impact on the model predictions.

- SHAP Plot for CS1350

In [23]:
#force plot using kernel explainer for CS1350
explainer = shap.KernelExplainer(estimator_rf.predict_proba, X_train)
shap.initjs()
shap_values = explainer.shap_values(X_test_CS1350.iloc[0,:])
shap.force_plot(explainer.expected_value[0], shap_values[0], X_test_CS1350.iloc[0,:])
Out[23]:
Visualization omitted, Javascript library not loaded!
Have you run `initjs()` in this notebook? If this notebook was from another user you must also trust this notebook (File -> Trust notebook). If you are viewing this notebook on github the Javascript has been stripped for security. If you are using JupyterLab this error is because a JupyterLab extension has not yet been written.
In [24]:
#force plot for CS1350
tree_explainer_2 = shap.TreeExplainer(estimator_rf)
shap.initjs()
shap_values_2 = tree_explainer_2.shap_values(X_train)
shap.force_plot(tree_explainer_2.expected_value[0], shap_values_2[0], X_test_CS1350.iloc[0,:].values)
Out[24]:
Visualization omitted, Javascript library not loaded!
Have you run `initjs()` in this notebook? If this notebook was from another user you must also trust this notebook (File -> Trust notebook). If you are viewing this notebook on github the Javascript has been stripped for security. If you are using JupyterLab this error is because a JupyterLab extension has not yet been written.

Interpretation

Similarly for client 1350, the SHAP plot using kernel estimator clearly explains him/her not receieving the call. Even if the client responded positive on receiving promotional calls, the feature "amt_max_12M = 120" has a higher than the expected value but it still has negative impact on the model predictions. In addition to that, the feature "amt_mean_3M" has value much lower than the base value and is significantly impacting the model prediction. We need to readjust the values of these two features to rectify the issue.

The second plot represents only the instance of the first feature i.e. "amt_max_12M = 120 " and shows its impact on the model predictions.